Attention Mechanism

Introduction

Before the invention of transformers, we were in the era of word embedding. Word embedding are fixed static representations of words. They are not able to capture the context of the words. So, you may ask, how does the model know the context of the words? In those days, the RNNs were the state of the art model for NLP tasks. They are able to capture the context of the words by using the hidden state. But RNNs have several limitations:

Both their forward and backward passes are sequential, which makes them very slow.
They are not scalable which means that by adding more RNN layers, teh model doesn’t necessarily perform better.

Because of these limitations, the researchers came up with the Attention Mechanism and Transformers. Before explaining the transformer, we need to understand the attention mechanism first since they lie at the heart of the transformer.

Intuition before the math

Before diving into the matrices, let’s look at the Why? The image above demonstrates Self-Attention—the mechanism that allows a model to understand a word based on the words surrounding it.

In the two sentences provided:

The Problem: The word bank is a homonym. Without context, a computer doesn’t know if it’s a place to fish or a place to store money.
The Heads: The colored arrows represent different Attention Heads. Just as humans look for different clues, the model has specific heads looking for different types of relationships:
- Who head (Purple): Identifies the subject performing the action (Sherry vs. Ryan).
- When head (Green): Looks for temporal or state-based context (is lying vs. needed to transfer).
- Where head (Orange): Focuses on the physical or conceptual location.
The Result: In the first sentence, the Where head strongly connects bank to lying on, signaling a geographical feature. In the second, it connects bank to money and transfer, signaling a financial institution.

The math below is simply the process of turning these logical connections into numerical weights.

Overview

Here’s how attention works, broken down step-by-step with live matrix values you can step through in the interactive demo.

The core formula is:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \cdot V\]

let’s walk through what each matrix does. Here’s the big picture of what each matrix does:

\(X\) is your input — one row per token, with its embedding vector. From there, three separate linear projections create \(Q\), \(K\), and \(V\). Think of them as: \(Q\) (what am I looking for?), \(K\) (what do I offer to match against?), \(V\) (what content do I actually carry?).

The dot product \(Q \cdot K^T\) gives a score matrix where every token queries every other token simultaneously — this is what makes attention parallel rather than sequential. Dividing by \(\sqrt{d_k}\) is a stabilization trick: large embedding dimensions cause dot products to grow large, which pushes softmax into near-zero gradients.

After softmax, each row is a probability distribution over all tokens — the actual attention weights. The final step \(A \cdot V\) is just a weighted average: each token’s output is a blend of all value vectors, weighted by how much it attended to each position.

In practice, this whole operation runs in parallel across multiple heads (each with its own \(W_q\), \(W_k\), \(W_v\)), which lets the model attend to different aspects of the sequence simultaneously.

Math Formulas

Here are all the attention mechanism formulas in order:

1. Input embeddings

\[X \in \mathbb{R}^{T \times d}\]

With T=5 tokens and d=3 dimensions.

2. Linear projections

\[Q = X W_Q, \quad K = X W_K, \quad V = X W_V\]

where \(W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}\), giving \(Q, K, V \in \mathbb{R}^{T \times d_k}\).

3. Raw attention scores

\[\text{Scores} = Q K^\top \in \mathbb{R}^{T \times T}\]

Entry \((i, j)\) measures how much token \(i\) attends to token \(j\):

\[\text{Scores}_{ij} = \sum_{k=1}^{d_k} Q_{ik} \cdot K_{jk}\]

4. Scaled scores

\[\text{Scaled} = \frac{QK^\top}{\sqrt{d_k}}\]

With \(d_k = 3\): \(\sqrt{d_k} = \sqrt{3} \approx 1.732\).

5. Softmax (per row)

\[A_{i} = \text{softmax}\!\left(\frac{Q_i K^\top}{\sqrt{d_k}}\right) = \frac{\exp\!\left(\text{Scaled}_{ij}\right)}{\sum_{j'} \exp\!\left(\text{Scaled}_{ij'}\right)}\]

giving attention weight matrix \(A \in \mathbb{R}^{T \times T}\), where each row sums to 1:

\[\sum_{j=1}^{T} A_{ij} = 1 \quad \forall\, i\]

6. Output

\[\text{Output} = A V \in \mathbb{R}^{T \times d_k}\]

Each output row is a weighted sum of value vectors:

\[\text{Output}_i = \sum_{j=1}^{T} A_{ij} \cdot V_j\]

Full formula (single expression)

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]

Multi-head attention

With \(h\) heads, each head \(i\) has its own projections \(W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \in \mathbb{R}^{d \times d_k}\):

\[\text{head}_i = \text{Attention}(X W_Q^{(i)},\ X W_K^{(i)},\ X W_V^{(i)})\]

The heads are concatenated and projected back:

\[\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\, W_O\]

where \(W_O \in \mathbb{R}^{h d_k \times d}\) is the output projection.